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Abstract 

One of the key challenges in identifying nonlinear and possibly non- 
Gaussian state space models (SSMs) is the intractability of estimating the 
system state. Sequential Monte Carlo (SMC) methods, such as the parti¬ 
cle filter (introduced more than two decades ago), provide numerical solu¬ 
tions to the nonlinear state estimation problems arising in SSMs. When 
combined with additional identihcation techniques, these algorithms pro¬ 
vide solid solutions to the nonlinear system identification problem. We 
describe two general strategies for creating such combinations and discuss 
why SMC is a natural tool for implementing these strategies. 


Sequential Monte Carlo methods for 
system identification 


Thomas B. Schon, Fredrik Lindsten, Johan Dahlin, 
Johan Wagberg, Christian A. Naesseth, 
Andreas Svensson and Liang Dai* 

March 11, 2016 


Abstract 

One of the key challenges in identifying nonlinear and possibly non- 
Gaussian state space models (SSMs) is the intractability of estimating the 
system state. Sequential Monte Carlo (SMC) methods, such as the parti¬ 
cle filter (introduced more than two decades ago), provide numerical solu¬ 
tions to the nonlinear state estimation problems arising in SSMs. When 
combined with additional identihcation techniques, these algorithms pro¬ 
vide solid solutions to the nonlinear system identification problem. We 
describe two general strategies for creating such combinations and discuss 
why SMC is a natural tool for implementing these strategies. 


‘This work was supported by the projects Learning of complex dynamical systems (Con¬ 
tract number: 637-2014-466) and Probabilistic modeling of dynamical systems (Contract 
number: 621-2013-5524), both funded by the Swedish Research Council. TS, JW, AS 
and LD are with Division of Systems and Control, Uppsala University, Uppsala, Swe¬ 
den. E-mail: {thomas.schon, johan.wagberg, andreas.svensson, liang.daijSit.uu.se. 
FL is with the Department of Engineering, University of Cambridge, Cambridge, United 
Kingdom. E-mail: fredrik.lindstenaeng.cam.ac.uk. JD and CAN are with the 
Division of Automatic Control, Linkoping University, Linkoping, Sweden. E-mail: 
{Christian.a.naesseth,Johan.dahlinjaiiu.se. 



1 Introduction 


This paper is concerned with system identification of nonlinear state space mod¬ 
els (SSMs) in discrete time. The general model that we consider is given by, 


Xt+i\xt ^ f 0 {xt+i\xt,Ut), (la) 

yt\xtgeiyt\xt,ut), (lb) 

(6»-7r(6l)), (Ic) 


where the states, the known inputs and the observed measurements are denoted 
by Xt G X C , ut € U C K"" and yt Q R””, respectively. The dynamics 
and the measurements are modeled by the probability density functions (PDFs) 
fe{-) and ge{-), respectively, parameterised by the unknown vector 0 G 0 C 
The initial state xi is distributed according to some distribution gg{xi). For 
notational simplicity, we will from hereon (without loss of generality) drop the 
known input ut from the notation. When considering Bayesian identification, 
meaning that 9 is modelled as an unobserved random variable, we also place a 
prior 7r(0) on 9. We are concerned with off-line identification, i.e. we wish to 
find the unknown parameters 0 in Q based on a batch of T measurements. 

The key challenge that will drive the development throughout this paper is 
how to deal with the difficulty that the states xi-,t in Q are unknown. We will 
distinguish between two different strategies for handling this: 


1. Marginalisation amounts to marginalising (integrating out) the states 
from the problem and viewing 9 as the only unknown quantity of inter¬ 
est. In the frequentistic problem formulation, the prediction error method 
and direct maximisation of the likelihood belong to this 


Ljung 1999 


strategy. In the Bayesian formulation, the Metropolis-Bastings algorithm 
Metropolis et al. 1953 Hastings, 1970 can be used to approximate the 


2 . 


posterior distribution of the parameters conditionally on the data. 

Data augmentation treats the states as auxiliary variables that are 
estimated together with the parameters. The expectation maximisation 


(EM) algorithm of Dempster et al. 1977 solves the maximum likelihood 


formulation in this way and the Gibbs sampler of Geman and Geman 
|1984| solves the Bayesian problem by this strategy. 

In the special case when the model Q is linear and Gaussian, there are closed 
form expressions available from the Kalman filter and the associated smoother. 
The primary focus of this paper, however, is the more challenging nonlinear 
and/or non-Gaussian case. More than two decades ago [e.g., Gordon et al. 
1993[[Kitagawa 1993 sequential Monte Carlo (SMG) methods started to emerge 
with the introduction of the particle filter. These methods have since then 
undergone a rapid development and today they constitute a standard solution to 
the problem of computing the latent (i.e. unobserved/unknown/hidden) states 
in nonlinear/non-Gaussian SSMs. 






















The aim of this paper is to show how SMC can be used in solving the nonlin¬ 
ear system identification problems that arise in finding the unknown parameters 
in Q. We do not aim to cover all different methods that are available, but in¬ 
stead we aim to clearly describe exactly where and how the need for SMC arises 
and focus on the key principles. Complementary overview paper are provided 


by Kantas et al. 2014 and by Andrieu et al. 2004 


We consider the Bayesian and the maximum likelihood formulations, as de¬ 
fined in Sectio n The rest of the paper is divided into three parts. In the first 
part (Sectionsandwe describe the marginalisation and data augmentation 
strategies and show where the need for SMC arise. The second part (Section]^ 
provides a rather self-contained derivation of the particle filter and outlines 
some of its properties. Finally, in the third part (Section [o]^ we show how 
these particle methods can be used to implement the identification strategies 
described in the first part, resulting in several state-of-the-art algorithms for 
nonlinear system identification. Loosely speaking, the SMC-part of the various 
algorithms that we introduce is essentially a way of systematically exploring the 
state space in a nonlinear SSM Q in order to address the key challenge of 
dealing with the latent state sequence xi-t- 


2 Problem formulation 


There are different ways in which the system identification problem can be for¬ 
mulated. Two common formalisms are grounded in frequentistic and Bayesian 
statistics, respectively. We will treat both of these formulations in this paper, 
without making any individual ranking among them. First, we consider the fre¬ 
quentistic, or maximum likelihood (ML), formulation. This amounts to finding 
a point estimate of the unknown parameter for which the observed data is 
as likely as possible. This is done by maximising the data likelihood function 
according to 

6»ml = argmax peivi-.r) = argmax \ogpg{yi.,T)- (2) 

ees eee 


For a thorough treatment of the use of ML for system identification, see e.g., 
Ljung 1999 , Soderstrom and Stoica [1989] . 

Secondly, in the Bayesian formulation, the unknown parameters 9 are mod¬ 


eled as a random variable (or random vector) according to (Ic I. The system 


identification problem thus amounts to computing the posterior distribution of 
9 given the observed data. According to Bayes’ theorem, the posterior distri¬ 
bution is given by. 


p{S\yi-.T) = 


Pe{yi-T)T^{9) 

p{yi:T) 


(3) 


Note that the likelihood function should now be interpreted as the conditional 
PDF of the observations given the parameters 9, i.e. pe{yi-T) = p{yi:T\9). 
However, to be able to discuss the different identification criteria in a common 




















setting, we will, with a slight abuse of notation, denote the likelihood by pe^yi-.r) 
also in the Bayesian formulation. An early account of Bayesian system identifi- 


cation is provided by 

Peterka 

11981 

in Ninness and Henriksen 

2010|. 


The central object in both formulations above is the observed data likelihood 
PeiUi-.T), or its constituents pe{yt \ yi.t-i) as. 


peiyi-.T) = Wpe{yt I yi:t-i), 


(4) 


where we have used the convention yi|o — 0- For the nonlinear SSM 0, the 
likelihood Q is not available in closed form. This is a result of the fact that the 
latent state sequence Xi-t, i.e. p(xi-T \ yi-.r), is unknown. Indeed, a relationship 
between the likelihood and the latent states can be obtained via marginalization 
of the joint density pe{xi:T, Vi-.t) w.r.t. Xi-,t according to, 

Peiyi-.T) = J Pe{xi-.T,yi-.T)dxi.,T, (5) 

where the model provides a closed form expression for the integrand according 
to. 


T T-1 

Pe{xi-.T,yi:T) = ye{xi)Y\_9eiyt \ xt) fe{xt+i \ xt). ( 6 ) 

t=i t=i 

This expression does not involve any marginalisation, whereas the observed data 
likelihood peiyi-.r) is found by averaging the joint distribution pe{xi-,T,yi-.T) 
over all possible state sequences according to ([^. Equivalently, we can express 
Peiyi-.r) as in @ where the one-step predictive likelihood can be written (using 
marginalisation) as, 

ps{yt\yi-.t-i) = J ge{yt\xt)pe{xt\yi:t-i)<ixt. (7) 

These expressions highlight the tight relationship between the system identifi¬ 
cation problem and the state inference problem. A key challenge that will drive 
the developments in this work is how to deal with the latent states. For nonlin¬ 
ear system identification, the need for computational methods, such as SMC, is 
tightly coupled to the intractability of the integrals in (§ and 0 . 

To illustrate the strategies and algorithms introduced we will use them to 
solve two concrete problems, which are formulated below. 












I-Example 1: Linear Gaussian model- 1 

Our first illustrative example is a simple linear Gaussian state space (LGSS) 
model, given by 

xt+i = {).7xt+vt, ut ~ (8a) 

j/t = 0.5xt + et, et ~ A/'(0,0.1), (8b) 

(6»~ 0(0.01,0.01)), (8c) 

where the unknown parameter 9 corresponds to the precision (inverse variance) 
of the process noise Vt ■ Note that the prior for the Bayesian model is chosen as 
the Gamma (Q) distribution with known parameters, for reasons of simplicity, 
since this is the conjugate prior for this model. The initial distribution fj,g(xi) 
is chosen as the stationary distribution of the state process. Identification of 6 
is based on a simulated data set consisting of T = 100 samples j/moo with true 
parameter Oq = 1. 

I_I 


-Example 2: Nonlinear non-Gaussian model- 

Our second example, involving real data, is related to a problem in paleocli- 
matology. Shumway and Stoffer] [2011 considered the problem of modelling 
the thickness of ice varves (layers of sediments that are deposited from melting 
glaciers). The silt and sand that is deposited over one year makes up one varve 
and changes in the varve thicknesses indicates temperature changes. The data 
set that is used contains the thickness of 634 ice varves formed at a location in 
Massachusetts between years 9 883 and 9 250 BC. We make use of a nonlinear 
and non-Gaussian SSM proposed by Langrock [2011 to model this data: 


Xt+l\xt JV{xt+l-,4’Xt,T ^), 

yt\xt^ 0(j/t; 6.25,0.256 exp(-a;t)). 


(9a) 

(9b) 


with parameters 9 = {^,r}. The initial distribution fj,g{xi) is chosen as the 
stationary distribution of the state process. The data set and a more complete 
description of the problem is provided by Shumway and Stoffer [2011 . 


3 Identification strategy — marginalisation 

The marginalisation strategy amounts to solving the identihcation problem— 
either ([^ or ([^ —by computing the integral appearing in Q (or, equivalently, 
(§)• That is, we marginalize (integrate out) the latent states Xut and view 9 
as the only unknown quantity of interest. 

In some special cases the marginalisation can be done exactly. In partic¬ 
ular, for LGSS models the one-step predictive density in Q is Gaussian and 
computable by running a Kalman filter. For the general case, however, some 
numerical approximation of the integral in Q is needed. We will elaborate on 













this in Section where we investigate the possibility of using SMC to perform 
the marginalisation. For now, however, to illustrate the general marginalisation 
strategy, we will assume that the integral in Q can be solved exactly. 


3.1 ML identification via direct optimisation 

Consider first the ML formulation ([^. Direct optimisation (DO) amounts to 
working directly on the problem 


T 

^ML = argmax T^log / geivt \ xt)pe{xt \ yi-.t-i)dxt, (10) 

0G0 ^ J 

where we have rewritten the data log-likelihood using 0 and 0 . Even though 
we momentarily neglect the difficulty in computing the integral above it is typ¬ 
ically not possible to solve the optimisation problem (101 in closed form. In¬ 


stead, we have to resort to numerical optimisation methods, see e.g. |Nocedal and| 
Wright |2006| . These methods typically find a maximiser of the log-likelihood 
function logpg{yi-,T) by iteratively refining an estimate 6k of the maximiser 0 ml 
according to 


6k-\-i — 6k -f OfcSfc. 


( 11 ) 


Here, Sk denotes the search direction which is computed based on information 
about the cost function available from previous iterations. The step size afc, 
tells us how far we should go in the search direction. The search direction is 
typically computed according to 

Sk=H~^gk, 5fc = Velogpe(2/i:T)|g^g^, (12) 

where Hk denotes a positive definite matrix (e.g. the Hessian Vg logpe(?/i:T) or 
approximations thereof) adjusting the gradient gk- 


-Example 3: DO applied to the LGSS model- 


To apply the update in (101 for estimating 6 in we need to determine search 


direction Sk and the step lengths Ofc. Here, we select the search direction as the 
gradient of the log-likelihood, i.e. Hk is the identity matrix. The log-likelihood 
for (Isl) can be expressed as 


logpe( 2 /i:T) = '^\ogJ\f {yu0.5xt\t-i, Pt\t-i +0.1), 


(13) 


t=2 


where Xt\t_i and denotes the predicted state estimate and its covariance 

obtained from a Kalman filter. The gradient gk in (121 of the log-likelihood 
(13) can therefore be obtained by calculating VgXt\t_i and VgPt\t-i, which can 


be obtained f rom the Kalma n filter, using the so-called sensitivity derivatives 
introduced by Astrom 1980 . In the upper part of Figure]^ we present the log- 
likelihood (blue) computed using the Kalman filter together with the estimate 
^ML (orange) obtained by the DO algorithm. 























precision 0 


precision 0 


Figure 1: Upper: the log-likelihood estimates (green) together with the ML parameter 
estimates of the LGSS model obtain by the DO and the EM algorithms, respectively. 
The estimates sit on top of each other and are shown in blue. Lower: parameter 
posterior estimates obtained by the MH (left) and the Gibbs (right) algorithms, re¬ 
spectively. The vertical dashed lines indicate the true parameters of the model from 
which the data is generated and the dark grey lines indicate the prior density. 

















Within the DO method the use of SMC arise in evaluation of the cost func¬ 
tion in (10) and its derivatives to compute the search directions Sk in (12). 


3.2 Bayesian identification via Metropolis—Hastings 

Let us now turn to the Bayesian formulation ([^. As above, to illustrate the 
idea we consider first the simple case in which the marginalisation in ([^ can be 
done exactly. Still, in most nontrivial cases the posterior PDF in ([^ cannot be 
computed in closed form. The difficulty comes from the factor p{yi-T)^ known 
as the marginal likelihood, which ensures that the posterior PDF is properly 
normalised (i.e., that it integrates to one). The marginal likelihood can be 
expressed as, 

p{yi-.T) = J P9{yi-.T)'!T{O)d0. ( 14 ) 


Even if the likelihood pe{yi-T) is analytically tractable, we see that we need to 
carry out an integration w.r.t. 9. Furthermore, computing a point estimate, say 
the posterior mean of 9, also amounts to solving an integral 

^[9\yi-.T] = J 9p{9\yi,T)d9, (15) 

which may also be intractable. 

A generic way of approximating such intractable integrals, in particular those 
related to Bayesian posterior distributions, is to use a Monte Carlo method. In 
Section we will discuss, in detail, how SMC can be used to approximately 
integrate over the latents states of the system. The sequential nature of SMC 
makes it particularly well suited for integrating over latent stochastic processes, 
such as the states in an SSM. However, to tackle the present problem of inte¬ 
grating over the latent parameters we shall consider a different class of methods 
denoted as Markov chain Monte Carlo (MCMC). 

The idea behind MCMC is to simulate a Markov chain {9[m]}m>i- The 
chain is constructed in such a way that its stationary distribution coincides 
with the so-called target distribution of interest, here p{9 \ yi-.r)- If, in addition, 
the chain is ergodic —essentially meaning that spatial averages coincide with 
“time” averages—the sample path from the chain can be used to approximate 
expectations w.r.t. to the target distribution: 


1 

M -k + 1 


M 




ip{9)p{9 I yi:T)d9, 


(16) 


as M —>■ 00 , for some test function (p. Here —4 denotes almost sure convergence. 
Note that the first k samples have been neglected in the estimator (16), to avoid 
the transient phase of the chain. This is commonly referred to as the burn-in 
of the Markov chain. 

Clearly, for this strategy to be useful we need a systematic way of construct¬ 
ing the Markov chain with the desired properties. One such way is to use the 






Metropolis-Hastings (MH) algorithm. The MH algorithm uses an accept/reject 
strategy. At iteration m + 1, we propose a new sample O' according to, 


O' ~ q{-\0[m]), 


(17) 


where q{-) denotes a proposal distribution designed by the user and 0[m\ denotes 
the sample from the previous iteration m. The newly proposed sample O' will 
then be added to the sequence (i.e. 0[m + 1] = O') with probability 


a = 1 A ^ I O') 


= 1 A 


PiO[m] I q{0'\0[m]) 
Pe'iyi:T)Tr{0') q{0[m] 


O') 


Pd[m]iyi:T)TriO[m]) q{0'\0[m])’ 


(18a) 

(18b) 


where a A 6 is used to denote min (a, b) and a is referred to as the acceptance 
probability. Hence, with probability 1 — a the newly proposed sample is not 
added to the sequence (i.e. rejected) and in that case the previous sample is 
once more added to the sequence 0[m + \] = 0[m]. 

There exists a well established theory for the MH algorithm, which for ex¬ 
ample establish that it is ergodic and converges to the correct stationary dis¬ 
tribution. This has been developed since the algorithm was first introduced 


by Metropolis et al. 1953 and Hastings 1970 . Ninness and Henriksen 2010 


provides a nice account on the use of the MH algorithm for Bayesian system 
identification. 

-Example 4: MH applied to the LGSS model- 


To apply the MH algorithm for parameter inference in the LGSS model, we 
require a proposal distribution q{-) in ([TtI) and to calculate the acceptance prob¬ 
ability a in (18b). A standard choice for q{-) is a Gaussian random walk 


q{0' I 0[m]) = A/” (d' I d[m], cr, 


(19) 


where denotes the variance of the random walk. For this choice of proposal, 
the acceptance probability is 


a = 1 A 


Pe'iyi:T)Tr{0') 

Peini]{yi-.T)Tr{0[m])’ 


where q cancels since it is symmetric in 0. Note that the likelihood can be 


computed using the Kalman filter in analogue with (13) for the LGSS model. 


In the lower left part of Figure we present the resulting posterior estimate 
obtained from running the algorithm M = 20 000 iterations (discarding the first 
10 000 as burn-in). 




















4 Identification strategy — Data augmentation 

An alternative strategy to marginalisation is to make use of data augmentation. 
Intnitively, we can think of this strategy as a systematic way of separating one 
hard problem into two new and closely linked sub-problems, each of which is 
hopefully easier to solve than the original problem. For the SSM 0 these two 
problems amounts to 

1. finding information about the state sequence 

2. finding information about the parameters 9. 

The state sequence xi-,t is thus treated as an auxiliary variable that is esti¬ 
mated together with the parameters 9. Using the data augmentation termi¬ 
nology Tanner and Wong 1987 van Dyk and Meng 2001 , the state sequence 


Xi-T is referred to as the missing data, as opposed to the observed data yi-.T- By 
augmenting the latter with the former, we obtain the complete data {xi-,t, Vi-.t}- 
Naturally, if the complete data were known, then identification of 9 would 
have been much simpler. In particular, we can directly write down the complete 
data likelihood pe{xi-,T,yi-.T) according to Q, which contrary to the observed 
data likelihood pg{yi,T), does not involve any marginalisation. The two like¬ 
lihoods are related via ([^, suggesting that the complete data likelihood can 
indeed be used for identifying the unknown parameters. 

4.1 Expectation Maximisation (EM) 

Analogously with the marginalisation strategy, we can make use of data augmen¬ 
tation to address both the frequentistic and the Bayesian identification prob¬ 
lems. For the former identific ation criteria, ([^, the result is the expectation 
maximisation (EM) algorithm Dempster et al. 1977| . EM provides an iterative 
procedure to compute ML estimates of unknown parameters 9 in probabilistic 
models involving latent variables, like the SSM in 0. 

As a result of the conditional probability identity 

Pe{xi:T, yi-.r) = Pe{xi:T \ yi:T)P 0 {yi-.T), 

we can relate the observed and complete data log-likelihoods as 

logpe(yi:T) = logpe(a;i:T,yi:T) - logpe(a;i:T I (20) 

The EM algorithms operates by iteratively maximising the intermediate quantity 

Q{e,9') = J \ogpg{xi.,T,yi:T)pe'{xi:T\yi:T)dxi,T (21a) 

= Eg/ [\ogpe{xi.,T,yi:T) Ivi-.t] , (21b) 


according to: 














(E) Q(0, e[k]) = Egffc] [\ogpe{xx,T, yi-.r) \ Vi-.t], 

(M) 0[fc + 1] = argmax Q(0, 0[/c]). 
flee 

We can show that iterating the above Expectation (E) and Maximisation (M) 
steps implies 


Q{e,e') > Q{e',9') Pe{yi:T)>P0'{yi:T). (22) 


Hence, will by construction result in a monotonic increase in likelihood 

values. Hence, the complete data log-likelihood logpe(a^i:T, 2 / 1 :t) can, via the 
intermediate quantity Q in (Hj, be used as a surrogate for the original observed 
data likelihood function pe(yi:T) in solving the ML problem ([^. We are still 
required to compute the integral (211 and an important observation is now that 
we can approximate this integral (and its derivatives w.r.t. 9) using SMC. 


-Example 5: EM applied to the LGSS model- 

We need to compute the intermediate quantity to apply the EM algorithm for 
estimating the parameter in the LGSS model. For this model, we can write 


Q{9,9') = const. -I- Eg/ 


\ogpe{xi) + '^\ogAf{xt;0.7xt-i,9 


t=2 


yi-.T 


Note, that this expression results from that the parameter only is present in the 
latent state process. 

We can directly maximise the intermediate quantity for 9 since the system 
is linear in the parameters. By taking the gradient of Q{9,9'), we obtain terms 
proportional to Ee'[xt-iixt - 0.7xt_i) | , where Xt\TXt\T and Xt\TXt-i\T 

denotes smoothed state estimates and Pt^t\T and Pt-i^t\T their covariances, re¬ 
spectively. These can be computed using a Kalman smoother and we refer the 
reader to Gibson and Ninness 2005| for the explicit expressions for implement¬ 
ing this. In the upper part of Figure we present the parameter estimate ^ml 
( blue) obtained by the EM algorithm. We note that the parameter estimates 
obtained by the DO and EM algorithms are identical and overlapping. However, 
they differ from the true parameter due to the finite value of T. 


4.2 Gibbs sampler 

The Gibbs sampler is an MGMG method that produce samples from the joint 
distribution by alternatively sampling from its conditionals. Let us consider 
the Bayesian formulation, with the aim of computing (|^. Inspired by the data 
augmentation strategy, start by assuming that the complete data {ccu, yi-.r} is 
available. Bayes’ theorem then results in 

Peixv.T,yi-.T)'x{9) 
p{xi:T,yi-.T) 


p{9 I Xi..T,yi-.T) 


(23) 















Intuitively, if the states xi:t were known, then the identification problem would 


be much simpler and, indeed, computing the posterior in (231 is typically easier 
than in ([^. Firstly, the complete data likelihood is provided in (|^, whereas the 
likelihood peiyi-.r) is intractable in the general case. Secondly, in many cases of 
interest it is possible to identify a prior for 0 that is conjugate to the complete 


data likelihood, in which case the posterior (231 is available in closed form. 


The problem with (23) is of course that it hinges upon the state sequence 


being known. However, assume that we can simulate a sample of the state tra¬ 
jectory Xi-,T from its conditional distribution given the observed data yi:T and 
the system parameters 9, i.e. from the joint smoothing distribution. Further¬ 


more, assume that it is possible to sample from the distribution in (23). We can 
then implement the following algorithm: Initialise 0[O] G 0 arbitrarily and, for 
TO > 0, 


Sample xi:T[m] ~ Pe[m]{xi:T \ yi-.r)- (24a) 

Sample 9[m -I- I] ~ p{9 \ xi-,T[Tn\,yi,T)- (24b) 

This results in the generation of the following sequence of random variables 

0 [O],Xi,T [0], 0 [1], Xi:T [1], 0 [2], Xi,T [2],..., (25) 

which forms a mutual Markov chain in the parameters 6 and the states ccu. 


{9[m],Xi-,T[mWm>i- The procedure in (24) represents a valid MCMC method. 
More specifically, it is a particular instance of the so-called Gibbs sampler. The 
simulated Markov chain admits the joint distribution p{9,xi:T \ yi-.r) as a sta¬ 
tionary distribution. Furthermore, under weak conditions it can be can be shown 


to be ergodic. That is, the Markov chain generated by the procedure (24) can 


be used to estimate posterior expectations, and the estimators are consistent in 
the sense of Note that, if we are only interested in the marginal distri¬ 

bution p{9 I j/i:t), then it is sufficient to store the sub-sequence constituted by 
{9[m]}m>i, obtained by simply discarding the samples {a::i:T[TO]}m>i from (25). 


It is worth pointing out that it is possible to combine Gibbs sampling with 
other MCMC methods. For instance, if it is not possible to sample from poste¬ 


rior distribution (23) exactly, then a valid approach is to replace step (24b) of the 


Gibbs sampler with, e.g., an MH step with target distribution p{9 \ xu, J/1:t)- 
Similarly, for nonlinear state space models, the joint smoothing distribution in 


(24a) is not available in closed form, but it is still possible to implement the 


strategy above by sampling the state trajectory from an MCMC kernel targeting 


the joint smoothing distribution; see Section 8.2 


-Example 6: Gibbs applied to the LGSS model- 

To implement the Gibbs sampler for parameter inference in the LGSS model, we 


need to sample from the conditional distributions in (24). To generate samples 


of state trajectories given 9 and yi-.T, we can make use of the factorisation 
















of the joint smoothing distribution. Consequently, we can sample using 

the following backward simulation strategy: Sample xt ~ Pe{xT \ Vi-.t) and, for 
t = T — 1, ..., 1, sample 

Xt ~ pe{xt I xt+i, yi,t) oc pe{xt+i \ xt)pe{xt \ yi,t)- (27) 


We see that the backward simulation relies on the filtering distributions denoted 
by {peixt I yi-.t)}J=i and, indeed, for the LGSS model we can obtain closed form 
expressions for all the involved densities by running a Kalman filter. 

In the second step, (24b), we sample the parameters 9 by 


9 ^ p{9\ Xl:T, 

yi-.r) =G{9\a,/3), 

(28a) 

T 

<y = 0.01 -|- — 

5 

(28b) 

/3 = 0.01+^( 

r T-i . 


0.51x1 -1- ^ {xt+i - 0.7a;t)^ j , 

(28c) 


which is the result of a standard prior-posterior update with a Gamma prior 
and Gaussian likelihood. The closed-form expression for p{9\xi-,T,yi:T) is an 
effect of using a prior which is conjugate to the complete data likelihood. 

In the lower right part of Figure [l] we present the resulting posterior esti¬ 
mate obtained from the Gibbs sampler with the same settings as for the MH 
algorithm. We note that the posterior estimates are almost identical for the two 
methods, which corresponds well to the established theory. 

I_I 


The data augmentation strategy (here implemented via the Gibbs sampler) 
enabled us to approximate the posterior distribution p{9 \ yi-x) by separating 
the problem into two connected sub-problems (24). 


5 Sequential Monte Carlo 

Sequential Monte Carlo (SMC) methods offer numerical approximations to the 
state estimation problems associated with the nonlinear/non-Gaussian SSM 0- 
The particle filter (PF) approximates the filtering PDF pe{xt\yi:t) and the 
particle smoother (PS) approximates the joint smoothing PDF pe{xi:t \ yi-.t), or 
some of its marginals. The PF can intuitively be thought of as the equivalent 
of the Kalman filter for nonlinear/non-Gaussian SSMs. 

The SMC approximation is an empirical distribution of the form 

N 

Pe{xt\yi:t) =Y2wlS^i^{xt). (29) 

The samples {xl}^i are often referred to as particles —they are point-masses 
“spread out” in the state space, each particle representing one hypothesis about 





the state of the system. We can think of each particle x\ as one possible system 
state and the corresponding weight w\ contains information about how probable 
that particular state is. 

To make the connection to the marginalisation and data augmentation strate¬ 
gies introduced in the two previous sections clear, we remind ourselves where 
the need for SMC arise in implementing these strategies to identify 0 in Q. 
The PF is used to compute the cost function (101 and its derivatives in order 


to find the search directions (12). To set up an MH samler, we can make use 


of a likelihood estimate provided by the PF in order to compute the acceptance 
probabilities in When it comes to data augmentation strategies, the PS 


is used to approximate the intermediate quantity in (21) and in order to set up 


the Gibbs sampler, particle methods are used to draw a realisation from the 


joint smoothing distribution in (24aI. 


5.1 Particle filter 

A principled solution to the nonlinear filtering problem is provided by the fol¬ 
lowing two recursive equations: 


Peixt \yi:t) = 


geivt I xt)pg{xt I 


Pe{yt\yi:t-i) 


(30a) 

(30b) 


These equations can only be solved in closed form for very specific special cases, 
e.g., the LGSS model which results in the Kalman filter. We will derive the 


particle filter as a general approximation of (30) for general nonlinear/non- 
Gaussian SSMs. 

The particle filter—at least in its most basic form—can be interpreted as a 
sequential application of importance sampling. At each time step t we use impor¬ 
tance sampling to approximate the filtering PDF pg{xt \ yi-.t)- This is made pos¬ 
sible by using the expressions in (30) and by exploiting the already generated im¬ 
portance sampling approximation oipe(xt-i \ yi-.t-i). At time t = 1 we can find 
an empirical distribution (29) by approximating pg(xi \ yi) oc ge{yi \ xi)pig{xi) 


using importance sampling in the normal sense. We sample independently the 
particles from some proposal distribution rg{xi). To account for the 

discrepancy between the proposal distribution and the target distribution, the 
particles are assigned importance weights, given by the ratio between the tar¬ 
get and the proposal (up to proportionality), i.e. w\ oc gg[yi \ x\)yLg(x\)/rg{x\), 
where the weights are normalised to sum to one. 

We proceed in an inductive fashion and assume that we have an empirical 
approximation of the filtering distribution at time t — 1 according to 


N 


pg[xt-i\yi-.t-i) = ^wj_i4i_^(a:t_i). 


(31) 











Inserting this into (30b I results in 


Pe 


i—1 

N 

= J2'^l-ife(.xt\xl_i). (32) 


That is, we obtain a mixture distribution approximating P 0 {xt \ yi:t-i), where 
we have one mixture component for each of the N particles at time t — 1. 
Furthermore, inserting (32) into (30a) results in the following approximation of 
the filtering PDF 


Pe{xt\yi-.t) 


9e{yt\xt) 
Pe{yt 


N 


yi-.t-i) ^ 


wl_Jg{xt\xl_^). 


(33) 


The task is now to approximate (33) using importance sampling. Inspired by 


the structure of (33) we choose a proposal density (again denoted by rg) of the 


same form, namely as a mixture distribution. 


N 


reixtlyi-.t) = '^vl_irg{xt\xl_^,yt), 


(34) 


i=l 


where both the mixture components rg{xt \xt-i,yt) and the mixture weights 
vt-i are design choices constituting parts of the proposal distribution. 


To generate a sample from the mixture distribution (34) the following two- 
step procedure is used; first we randomly select one of the components, and 
then we generate a sample from that particular component. Note that we will 
sample N particles from the proposal distribution ( |M| . Let us use aj to denote 
the index of the mixture component selected for the particle at time t. Now, 
since the probability of selecting component rg{xt \xl_i^yt) is encoded by its 
weight vl_^, we have that 


' (a* = i) = 


j = 1, ..., iV. 


(35) 


Subsequently, we can generate a sample from the selected component a\ accord¬ 
ing to x\ ^ rgixt \ x\_i,yt), where x\_i = x'^ti- By construction xl is now a 


sample from the proposal density (34). The particle xl_i is referred to as the 
ancestor particle of x]., since xl is generated conditionally on xl_i. This also 
explains why the index a) is commonly referred to as the ancestor index, since 
it indexes the ancestor of particle x) at time t — 


In practice we sample the N ancestor indices according to (35) in one 


go. This results in a new set of particles {x\-i\p^i that are subsequently used 
to propagate the particles to time t. This procedure, which (randomly) gener¬ 
ates by selection (sampling with replacement) from among {xl_i}f^i 

according to some weights, is commonly referred to as resampling. 











The next step is to assign importance weights to the new particles accounting 
for the discrepancy between the target distribution pe{xt \ yi-t) and the proposal 
distribution r0{xt \ yi-.t)- As before, the weights are computed as the ratio be¬ 
tween the (unnormalised) target PDF and the proposal PDF. Direct use of (331 


and (34) results in 


geivt I xi) wi_Jeixi \ xl_^) 


w, = 




(36) 




By evaluating wl for i = 1, ..., N and normalising the weights, we obtain a new 
set of weighted particles {xl,wl}^i, constituting an empirical approximation 
of pe{xt \yi-.t)- This completes the algorithm, since these weighted particles in 
turn can be used to approximate the filtering PDF at time t + 1, then at time 
t + 2 and so on. 

A problem with the algorithm presented above is that the weight calculation 


in (36) has a computational complexity of 0{N) for each particle, rendering 
an overall computational complexity of 0{N'^), since the weights need to be 
computed for all N particles. A pragmatic solution to this problem is to use the 
freedom available in the proposal density and select it according to rg{xt \ yi.t) = 
w{_ife{xt I xi_i). That is, we select ancestor particles with probabilities 
given by their importance weights and sample new particles by simulating the 


system dynamics from time t—1 tot. Inserting this into (36) results in the simple 


expression wl = ge{yt \x\), which brings the overall computational complexity 
down to 0{N). The resulting algorithm is referred to as the bootstrap particle 
filter and it is summarised in Algorithm 

The bootstrap PF was the first working particle filter, an early and influential 
derivation is provided by Gordon et al.| [1993 . It is arguably the simplest 
possible implementation of SMC, but nevertheless, it incorporates the essential 
methodological ideas that underpin the general SMC framework. Importance 
sampling (i.e. propagation and weighting in Algorithm]^ above) and resampling 
are used to sequentially approximate a sequence of probability distributions of 
interest; here {pe{xt \ yi-.t)}t>i- 

Selecting the dynamics as proposal distribution, as in the bootstrap particle 
filter, is appealing due to the simplicity of the resulting algorithm. However, 


Algorithm 1 Bootstrap particle filter (all operations are for i = 1 , 


1: 

Initialisation (t = 1): 


2: 

Sample x\ ~ gg{xi). 


3: 

Compute wl = gg{yi | ®i), normalise, wl = wl/J2^-i 


4: 

for t = 2 to T do 


5: 

Resampling: Sample a] with P (aj = j) = wfi 


6: 

Propagation: Sample x] ~ fe{xt \ ®tl.i)- 


7: 

Weighting: Compute wl = ge{yt \ xl) and normalise, 
wl = wi/J2f=iwi. 


8: 

end 













this choice is unfortunately also suboptimal, since the current measurement yt is 
not taken into account when simulating the particles {xl}fLi from the proposal 
distribution. A better strategy of reducing the computational complexity of the 
weight computation from 0{N'^) to 0{N) is to target the joint distribution of 
{xt,at) with an importance sampler, instead of directly targeting the marginal 
distribution of Xt as was done above. Indeed, by explicitly introducing the 
ancestor indices as auxiliary variables in the importance sampler, we obtain the 
weight expression 


WtLi99{yt\x\)f9{x\\x'il^) 

VtUre{x\\x‘il^,yt) 


(37) 


as a more practical alternative to ( |36| ). With this approach we have the pos¬ 
sibility of freely selecting the mixture weights vt-i and mixture components 
r 0 (xt \xt-i,yt) of the proposal, while still enjoying an overall linear computa¬ 
tional complexity. The resulting algorithm is referred to as the auxiliary particle 
filter (APF). Rather than providing the details of the derivation we simply refer 

or our complete derivation in 


to the original paper by 

Pitt and Shephard 

1999 

Schon and Lindsten 

2015 . 


5.2 Some useful properties of the PF 

As any Monte Carlo algorithm, the PF can be interpreted as a random num¬ 
ber generator. Indeed, the particles and the ancestor indices used within the 
algorithm are random variables, and executing the algorithm corresponds to 
simulating a realisation of these variables. It can be useful, both for under¬ 
standing the properties of the PF and for developing more advanced algorithms 
around SMC, to make this more explicit. Let 

Xt A {xl, ..., a;f}, and a* = {a\, ..., af}, (38) 


refer to all the particles and ancestor indices, respectively, generated by the PF 
at time t. The PF in Algorithm then generates a single realisation of a collec¬ 
tion of random variables a.2-.T} € x {I, ..., Furthermore, 

since we know how these variables are generated, we can directly write down 
their joint pdfQ as, 

N TfN . 'I 

V'e(xi:T,a 2 :T |yi:T) = Wy9{x\) /s(xj | ) \ . (39) 

i=l t=2 U=1 J 


Naturally, any estimator derived from the PF will also be a random variable. 
From (39) we note that the distribution of this random variable will depend 
on the number of particles N, and convergence of the algorithm can be iden¬ 
tified with convergence, in some sense, of this random variable. Specifically, 
let : X I—>■ M be some test function of interest. The posterior expectation 


^w.r.t. to a natural product of Lebesgue and counting measure. 
















Eg [ip{xt) I yi,t\ = J tp{xt)pe{xt \ yi:t)Axt, can be estimated by the PF by com¬ 
puting (cf., 


N 




= ^wlv{x\)- 


(40) 


There is a solid theoretical foundation for SMC, e.g., investigating the conver¬ 
gence of (401 to the true expectation as N ^ oo and establishing non-asymptotic 
bounds on the approximation error. The (types of) existing theoretical results 
are too numerous to be mentioned here, and we refer to the book by [Del Moral] 
for a comprehensive treatment. However, to give a flavour of the type 


2004 


of results that can be obtained we state a central limit theorem (CLT) for the 
estimator (401. Under weak regularity assumptions it holds that |Del Moral and 
[200^|Del Moral| [20041 [Chopii^ [2004| 


ViV (^f - Ee [ip{xt) I yi-.t]) 


(41) 


as —>■ oo where —> denotes convergence in distribution. The asymptotic 
estimator variance cr^(ip) depends on the test function tp, the specific PF imple¬ 
mentation that is used and, importantly, the properties of the state space model 


(an explicit expression for cr^(<p) is given by Doucet and Johansen 2011 ). 


The CLT in (41) is reassuring since it reveals that the estimator converges at 
a rate '/N, which is the same rate as for independent and identically distributed 
(i.i.d.) Monte Carlo estimators. An interesting question to ask, however, is how 


the asymptotic variance depends on t. In particular, recall from (33) that we 


use the approximation of the filtering distribution at time t — 1, in order to 
construct the target distribution, which in turn is approximated by the particles 
at time t. This “approximation of an approximation” interpretation of the PF 
may, rightfully, lead to doubts about the stability of the approximation. In 
other words, will the asymptotic variance grow exponentially with tl 

Fortunately, in many realistic scenarios, the answer to this question is no. 
The key to this result is that the model exhibits some type of forgetting, essen¬ 
tially meaning that the dependence between states Xs and Xt diminishes (fast 
enough) as \t — s\ gets large. If this is the case, we can bound af{(p) < C for 
some constant C which is independent of t, ensuring the stability of the PF 
approximation. We refer to Del Moral and Guionnet 2001 , Chopin [2004 for 
more precise results in this direction. 

In analogy with the Kalman filter, the PF does not only compute the filtering 
distribution, but it also provides (an approximation of) the likelihood ps{yi-.t), 
which is central to the system identification problem. For the bootstrap PF in 
Algorithm this is given by. 


N 




S=1 


(42) 


Note that the approximation is computed using the unnormalised importance 
weights The expression (42) can be understood by considering the 































factorisation Q and noting that the one-step predictive likelihood, by Q, can 
be approximated by, 


pe{ys\yi-.s-i) = j ge{ys\xs)p 0 {xs\yi,s-i)^xs 


N 


N 




i=l 


N 


i=l 


where {a;s}i=i simulated from the bootstrap proposal given by re(xs | yi-.s) = 
Pe(xs I yi-.s-i) (a similar likelihood estimator can be defined also for the general 
APF). 


Sharp convergence results are available also for the likelihood estimator (42). 

i.e. ^eiyi-.t)] = Peivi-.t) for any value 


First of all, the estimator is unbiased. 
of N, where the expectation is w.r.t. the randomness of the PF [Pitt et ah 2012 


Del Moral 2004 . We will make use of this result in the sequel. Furthermore, the 
estimator is convergent as IV —>■ oo. In particular, under similar regularity and 
forgetting conditions as mentioned above, it is possible to establish a CLT at 


rate y/N also for (42). Furthermore, the asymptotic variance for the normalised 
likelihood estimator can be bounded hy D ■ t for some constant D. Hence, in 


contrast with the filter estimator (40), the asymptotic variance for (42) will grow 


with t, albeit only linearly. However, the growth can be controlled by selecting 
N (X t, which provides a useful insight into the tuning of the algorithm if it is 
to be used for likelihood estimation. 


5.3 Particle smoother 


The PF was derived as a means of approximating the sequence of filtering den¬ 
sities {pg{xt 12/i:t)}t>i- We can also start from the forward smoothing relation 


P0{xi:t I yi-.t) 


Peixi.t-i I yi-.t-i) 


fg{xt I Xt-i)gg(yt I Xt) 

Pe{yt \ yi-.t-i) 


(43) 


and derive the particle filter as a means of approximating the sequence of joint 
smoothing densities {pe{xi.,t \ yi:t)}t>i- Interestingly, the resulting algorithm is 
equivalent to the PF that we have already seen. Indeed, by using the ances¬ 
tor indices we can trace the genealogy of the filter particles to get full state 
trajectories, resulting in the approximation 


N 

Pe{xi-.t\yi-.t) =^wl6^i^Jxi:t). (44) 

i=l 


However, there is a serious limitation in using the PF as a solution to the 
smoothing problem, known as path degeneracy. It arises due to the fact that 
the resampling step, by construction, will remove particles with small weights 
and duplicate particles with high weight. Hence, each resampling step will 
typically reduce the number of unique particles. An inevitable results of this is 















that for any given time s there exists t > s such that the PF approximation of 
Peixi:t I yi-.t) collapses to a single particle at time s. 

One solution to the path degeneracy problem is to propagate information 
backwards in time, using a forward/backward smoothing technique. The joint 
smoothing distribution can be factorised as in (261 where each factor depends 
only on the filtering distribution (cf. (I?])). Since the filter can be approximated 
without (directly) suffering from path degeneracy, this opens up for a solution to 
the path degeneracy problem. An important step in this direction was provided 
by Godsill et al. [2004 , who made use of backward simulation to simulate com¬ 
plete state trajectories xu, approximately distributed according to the joint 
smoothing distribution peixi:T Ivi-.t)- The idea has since then been refined, 
see e.g. Done et al. [2011] , Bunch and Godsill 2013 . Algorithms based on the 
combination of MGMC and SMC introduced by Andrieu et al. 2010 , resulting 


in the particle MGMC (PMCMC) methods, also offer promising solutions to the 
nonlinear state smoothing problem. For a self-contained introduction to particle 
Lindsten and Schon [2013 . 


smoothers, see 


6 Marginalisation in the nonlinear SSM 

Now that we have seen how SMC can be used to approximate the filtering distri¬ 
bution, as well as the predictive and smoothing distributions and the likelihood, 
we are in the position of applying the general identification strategies outlined 
in the previous sections to identify nonlinear/non-Gaussian state space models. 


6.1 Direct optimisation using Fisher’s identity 

Consider the maximum likelihood problem in ([^. The objective function, i.e. 
the log-likelihood, can be approximated by SMC by using ( [4^ . However, many 
standard optimisation methods requires not only evaluation of the cost function, 
but also the gradient and possibly the Hessian, in solving ([^. SMC can be used 
to compute the gradient via the use of Fisher’s identity, 


Ve logpe(yi:T)L „ =VeQ(6», 6»fc) 




(45) 


where the intermediate quantity Q was defined in (21). It follows that 


Ve logPe(l/i:T) — Eg [Vg logpg(Xi:T, yi:T) I Ui-.t] ■ 


(46) 


That is, the gradient of the log-likelihood can be computed by solving a smooth¬ 
ing problem. This opens up for gradient approximations via a particle smoother, 
as discussed in Section 5.3 see e.g. Poyiadjis et al. 2011 for further details. 


The Hessian can also be approximated using, for example, Louis’ identity [e.g.. 


Cappe et al. 2005 


Note that the gradient computed in this way will be stochastic, since it is 
approximated by an SMC method. It is therefore common to choose a dimin¬ 
ishing step-size sequence of the gradient ascent method according to standard 
































stochastic approximation rules; see e.g., Kushner and Yin |1997 , Benveniste 


et al. 1990 . However, it should be noted that the approximation of the gradi¬ 


ent of the log-likelihood will be biased for a finite number of particles N, and 
the identification method therefore relies on asymptotics in N for convergence 
to a maximiser of 


6.2 Using unbiased likelihoods within MH 


We can make use of the likelihood estimator (42) also for Bayesian identification 


of nonlinear SSMs via the MH algorithm. Indeed, an intuitive idea is to simply 


replace the intractable likelihood in the acceptance probability (18) by the (un¬ 
biased) estimate pe{yi-.T)- What is maybe less intuitive is that this simple idea 
does in fact result in a valid (in the sense that it has p{6 \ yi-x) as its stationary 
distribution) MH algorithm, for any number of particles TV > 1. Let us now 
sketch why this is the case. 

We start by introducing a (high-dimensional) auxiliary variable u consti¬ 
tuted by all the random quantities generated by the PF, i.e. u = ci 2 :t} 


distributed according to ipg{u \ yi-.r) defined in (39). Note that the joint distri¬ 
bution of the parameters 6 and the auxiliary variables u, 


p{d, u I yi-r) = 'ip0{u\ yi,T)p{d \ yi.r) 

_ peiyi-.T)'ipeiu \ yi,T)T^{0) 

P{yi:T) 


(47a) 

(47b) 


has the original target distribution p{9 \ yi-.r) as one of its marginals. Inspired 
by (47b I, consider the following extended target distribution 


(l)i^,u\yi:T) = 


Pe,uiyi-T)i’e{u I yi:T)7r(6>) 
p{yi:T) 


(48) 


where we have made use of the unbiased likelihood estimate pe^uiyi-.T) from 
the PF (and indicate explicitly the dependence on u in the notation for clar¬ 
ity). We can now set up a standard MH algorithm that operates in the (huge) 
non-standard extended space 0 x x {1, ..., approximating the 

extended target distribution (48). The resulting algorithm will generate samples 


asymptotically from pl6 \ yi-.r) despite the fact that we employ an approximate 
likelihood in (48)! To understand why this is the case, let us marginalise (48) 


w.r.t. the auxiliary variable u: 


r 

U I yi:T)du = J Pe,u{yi:T)tp9{u I yi-.T)du. (49) 

The fact that the likelihood estimate pe,u{yi-.T) produced by the PF is unbiased 
means that 


^u\e[pe,u{yi-.T)]= / Pe^u{yi-.T)^e{u\yi-.T)diU = pe{yi,T)- 


(50) 



















Algorithm 2 Particle Metropolis Hastings (PMH) for Bayesian system identifica¬ 
tion of nonlinear SSMs 


1 

Run a PF (Algorithm targeting p{x\-.T \ fi')!]) to obtain u' ^ 
PB[i],u'{yi-.T) according to (|4^. 

" ^e[r](w|j/i:T) and 

2 

for m = 1 to M do 


3 

Sample 9' ~ g(- | 9\m\). 


4 

Run a PF (Algorithm targeting p{xi:T \ 9') to obtain u' 
Pe',u'{yi:T) according to (|42|. 

~ -!/’e'(w yr.r) and 

5 

Sample dm ~ W[0,1]. 


6 

Compute the acceptance probability a by (52l. 


7 

if dm < at then 


8 

6l[m+ 1] 9' and pe[m+i]( 2 /i;T) ^Pe'{yi-.T). 


9 

else 


10 

9[m + 1] ^ 9[m] and Pe[m+i]{yi-.T) ^ Pe[m]{yi-.T). 


11 

end if 


12 

end for 



The marginalisation in (491 can now be finalised, resulting in f u \ yi-T)<iu = 
p{(^ I Vi-.t), proving that p(9 \ Di-.t) is recovered exactly as the marginal of the ex¬ 
tended target distribution (48), despite the fact that we employed a PF approxi¬ 
mation of the likelihood using a finite number of particles N. This explains why 
it is sometimes referred to as an exact approximation. An interpretation is that 
using the likelihood estimate from the PF does change the marginal distribution 
w.r.t. u in (47), but it does not change the marginal w.r.t. 9. 

Based on the current sample (9[m],u[m]) a new sample 19',u') is proposed 
according to 


6»'- g(-16»[m]), w'- V’e'(• I 


(51) 


We emphasise that simulation of u' corresponds to running a PF with the model 

probability of accepting 
\],u[m -P 1]) is given by 


parameterised by 9'. The probability of accepting the sample proposed in (51) 
as the next sample {9[m 


a = 1 A 


%',u'{yi:T)T^{d') q[9[m] I 9') 


P0[. 


m\,u\m 


{yi:T)Tr{9[m]) q{9' \ 9[m]) ’ 


(52) 


which was obtained by inserting (48) and (|51[) into (M 


'\m\,u\m 


In practice it is sufli- 
]}m>i! and we do not 
The above develop- 


cient to keep track of the likelihood estimates {pg[. 
need to store the complete auxiliary variable {u[m\}m>i 
ment is summarised in Algorithmj^ It can be further improved by incorporating 
gradient and Hessian information about the posterior into the proposal (51), re¬ 


sulting in more efficient use of the generated particles Dahlin et al. 2015 


The particle Metropolis Hastings algorithm constitutes one member of the 
particle MCMC (PMCMC) family of algorithms introduced in the seminal paper 
by Andrieu et al. |2010|. The derivation above is along the lines of the pseudo¬ 


marginal approach due to Andrieu and Roberts 2009 . The extended target 


construction (p, however, is the core of all PMCMC methods and they differ in 




























that different (more or less standard) MCMC samplers are used for this (non¬ 
standard) target distribution. They also have in common that SMC is used as 
a proposal mechanism on the space of state trajectories X^. 

-Example 7: PMH applied to the NL-SSM- 

We make use of Algorithm to estimate the parameters in ([^ together with a 
simple Gaussian random walk, 

e' - q{- 1 9[m]) = 7V(0[to], 2.5622^/2), 


where E denotes an estimate of the posterior covariance matrix. This choice is 


optimal for some target distributions as is discussed by Sherlock et al. 
The posterior covariance estimate is obtained as 


2013 


E = 10"'^ 


22.51 

-4.53 


-4.53 

2.57 


using a pilot run of the algorithm. In the upper part of Figure we present the 
resulting marginal posterior estimates. The posterior means 0 pmh = {0.95,51.05} 
are indicated by dotted lines. 

I_I 


7 Data augmentation in nonlinear SSM 

Algorithms implementing the data augmentation strategy treats the states as 
auxiliary variables that are estimated along with the parameters, rather than 
integrating them out. Intuitively this results in algorithms that alterate between 
updating 0 and Xi^t- 


7.1 Expectation maximisation 


The expectation maximisation algorithm introduced in Section [4.1| separates the 
maximum likelihood problem Q into two closely linked problems, namely the 
computation of the intermediate quantity Q{0,0[k]) and its maximisation w.r.t. 
0. As previously discussed, computing the intermediate quantity corresponds 
to solving a smoothing problem. Hence, for a nonlinear/non-Gaussian SSM, a 
natural idea is to use a particle smoother, as discussed in Section |5.3[ to solve 


this subproblem. The details of the algorithm are provided by |Cappe et al. 


2005 , Olsson et al. 2008 , Schon et al. 2011 , whereas the general idea of 


making use of Monte Carlo integration to approximate the E-step dates back to 


Wei and Tanner 1990 


By this approach, a completely new set of simulated particles has to be 
generated at each iteration of the algorithm, since we continuously update the 
value of 0. Once an approximation of Q{0, 0[k]) has been computed, the current 
particles are discarded and an entirely new set has to be generated at the next 
iteration. While it does indeed result in a working algorithm it makes for an 

























marginal posterior estimate marginal posterior estimate 



Figure 2: The marginal posterior estimates for (f> (left) and r (right) using the PMH 
algorithm (upper) and the PGAS algorithm (lower). The dotted vertical and the dark 
grey lines indicate the estimated posterior mean and the prior densities, respectively. 

















inefficient use of the particles. The PMCMC family of algorithms opens up for 
the construction of Markov kernels that can be used to generate samples of the 
state trajectory (to be used in the approximation of Q{9,9[k])) in a computa¬ 
tionally more efficient fashion, which serves as one (of several) motivation of the 
subsequent development. 


7.2 Sampling state trajectories using Markov kernels 

We now introduce another member of the PMCMC family of algorithms (recall 
PMH from Section 6.2) that can be used whenever we are faced with the prob¬ 


lem of sampling from an intractable joint smoothing distribution pe{xi:T \ Vi-.t)- 
In those situations an exact sample can be replaced with a draw from an MCMC 
kernel with stationary distribution pg{xi.T \ Vi-.t), without introducing any sys¬ 
tematic error, and PMCMC opens up for using SMC to construct such MCMC 
kernels. 

Here, we review a method denoted as particle Gibbs with ancestor sampling 
(PGAS), introduced by Lindsten et al. 2014 . To construct the aforementioned 


Markov kernel, PGAS makes use of a procedure reminiscent of the PF in Algo¬ 
rithm [2 The only difference is that in PGAS we condition on the event that 
an a priori specified state x'^ is always present in the particle system, for each 
time t. Hence, the states (a;j, ..., x'j<) must be retained throughout the sam¬ 
pling prodecure. To accomplish this we sample x\ according to the bootstrap 
PF only for i = 1, ..., A^ — 1. The remaining M*' particle x^ is then set deter¬ 
ministically as xf = x^. It is often the case that we are interested in complete 


particles trajectories; cf., (44). To generate a genealogical trajectory for 
it is possible to connect it to one of the particles at time t — 1, {xl_i}fLi by 
sampling a value for the corresponding ancestor index from its conditional 
distribution. This is referred to as ancestor sampling, see Algorithm]^ 

Note that Algorithm]^ takes as input a state trajectory x'^.j, = [x'^, ..., x'j^) 
and returns another state trajectory x\.rp, which is simulated randomly accord- 


Algorithm 3 PGAS kernel (with a bootstrap PF) 

1: Initialisation {t = 1): Draw x\ ~ g{xi) for i = 1, ..., A — 1 and set = x'l. 

Compute wl — ge{yt \ xl) for i = 1, ..., N. 

2: for t = 2 to T do 

3: Sample al with P (aj = j) = 'wi_i for i = 1, ..., A — 1. 

4: Sample xl ~ fe{xt \ x^ti) for i = 1, ..., A — 1. 

5: Set = x't- 

6: Draw with P (a)'^ = j) oc \ ®t_i). 

7: Set x\:t = {xil^_i, xl} for i = 1, ..., A. 

8: Compute wl = ge{yt \ xl) for i = 1, ..., A. 

9: end for 

10: Draw k with ¥ [k = i) oc Wj.. 

11: Return xl-^ = x\,q^. 














ing to some distribution (which, however, cannot be written on closed form). 
Hence, we can view Algorithm as sampling from a Markov kernel defined on 
the space of state trajectories X . This Markov kernel is referred to as the PGAS 
kernel. The usefulness of the method comes from the fact that the PGAS kernel 
is a valid MGMG kernel for the joint smoothing distribution pe{xi.,T \ Ui-.t) for 
any number of particles A^ > 2! A detailed derivation is provided by |Lindsten| 
et al. 2014 , who show that the PGAS kernel is ergodic and that it admits the 


joint smoothing distribution as its unique stationary distribution. This implies 
that the state trajectories generated by PGAS can be used as samples from the 
joint smoothing distribution. Hence, the method is indeed an interesting alter¬ 
native to other particle smoothers. Moreover, the PGAS kernel can be used as 
a component in any (standard) MGMC method. In the subsequent section we 
will make explicit use of this, both for ML and Bayesian identification. 


8 Identification using Markov kernels 

8.1 Expectation maximisation revisited 

In Section [7.1| we made use of particle smoothers to approximate the intractable 
integral defining the intermediate quantity Q{d,0[m\). However, it is possible 
to make more efficient use of the simulated variables by using the PGAS Algo¬ 
rithm 1^ and employing a stochastic approximation update of the intermediate 
quantity Q, 


N 

Qk{0) = (1 - ak)Qk-i{0) + ak'^wtp\ogpg{x\.T,yi:T), (53) 

2 = 1 


where Uk is the step size and {wtp, x\.j,}f^^ is generated by Algorit hm]^ Stochas- 


tic approximation EM (SAEM) was introduced and analysed by Delyon et al. 


1999 and it was later realised that it is possible to use MGMG kernels within 


SAEM Andrieu et al.[ |2005| (see also [Benveniste et al.[ [1990| ). The afore¬ 
mentioned particle SAEM algorithm for nonlinear system identification was 
presented by Lindsten 2013| and it is summarised in Algorithm]^ 


Algorithm 4 PGAS for ML sys. id. of nonlinear SSMs 

1: Initialisation: Set 6'[0] and arbitrarily. Set Qo = 0. 

2: for A: > 1 do 

3: Run Algorithmwith = X\-T\k — 1]. Set xi-.T[k] = x\.rp. 

4: Compute Qk{9) according to ( [53| ). 

5: Compute 9[k\ = argmax Qk{9). 

6: if convergence criterion is met then 

7: retnrn 9\k] 

8: end if 

9: end for 




























Algorithm 5 PGAS for Bayesian sys. id. of nonlinear SSMs 
1: Initialisation: Set 6'[0] and 2 :i:t[ 0] arbitrarily. 

2: for m = 1 to M do 

3: Run Algorithm|^conditionally on {x\-T['ni — 1], and set 

4: Draw 0[m] ~ p(6 I a;i;T[ni], 

5: end for 


Note the important difference between the SMC-based EM algorithm out¬ 
lined in Section 7.1 and Algorithm In the former we generate a completely 
new set of particles at each iteration, whereas in particle SAEM all simulated 
particles contribute, but they are down-weighted using a forgetting factor given 
by the step size. This approach is more efficient in practice, since we can use 
much fewer particles at each iteration. In fact, the method can be shown to 
converge to a maximiser of ([^ even when using a fixed number of particles 
N > 2 when executing Algorithm 


8.2 Bayesian identification 

Gibbs sampling can be used to simulate from the posterior distribution ^ or 
more generally, the joint state and parameter posterior p{0,Xi.t Ivi-.t)- The 
PGAS kernel allows us to sample the complete state trajectory xi-t in one 
block. Due to the invariance and ergodicity properties of the PGAS kernel, the 
validity of the Gibbs sampler is not violated. We summarise the procedure in 
Algorithm 

-Example 8: PGAS applied to Q - 

To make use of Algorithm to estimate the parameters in ([^, we need to 
simulate from the conditional distribution d[m] ^ p(0 | a:i:T[w], This 

distribution is not available in closed form, however we can generate samples 
from it by using rejection sampling with the following instrumental distribution 


q{(t>,T\xi,T[m],yi:T) =g {T-a,/3)Af , 

T- 1 

a = 0.01-k , 


/3 = 0.01 -k ^ - 

- _ xt+i[m]xt[m] 

A —1 r 12 ^ 

J2t=2 


1 xt+i[m]xt[m]j 

2 

T-1 

f = T ^ xt[mf. 


In the lower part of Eigure we present the resulting marginal posterior esti¬ 
mates. The posterior means 0 pg = {0.953,44.37} are indicated by dotted lines. 

I_I 













9 Future challenges 


We end this tutorial by pointing out directions for future research involving 
interesting challenges where we believe that SMC can open up for significant 
developments. 

Over two decades ago the SMC development started by providing a solution 
to the intractable filtering problem inherent in the nonlinear SSM. We have since 
then seen that SMC in indeed much more widely applicable and we strongly 
believe that this development will continue, quite possibly at a higher pace. 
This development opens up entirely new arenas where we can use SMC to solve 
hard inference problems. To give a few concrete examples of this we have the 
Bayesian nonparametric models (such as the Dirichlet and the Beta processes) 
that are extensively used in machine learning. There are also the so-called 
spatio-temporal models, which do not only have structure in time, but also 
in a spatial dimension, imagine the weather forecasting problem. A known 
deficiency of the standard (bootstrap) particle filter is its inability to handle 
high-dimensional variables Xt [Bickel et al. 2008 , which is usually the case in 
for example spatio-temporal models. However, some very recent work has shown 
promising directions to tackle high-dimensional models in a consistent way using 


SMC Naesseth et al. 2014, Beskos et al.[ 2014[ Naesseth et al., 2015 . 

There is a well-known (but underutilised) duality between the control prob¬ 
lem and the model learning problem. Coupled with the various SMC based 
approximations this opens up for fundamentally new controllers to be learnt by 
formulating the policy optimisation in control problems as an inference problem. 


For some work along this direction, see e.g. Doucet et al., 2010, Hoffman et al. 


2009 Toussaint and Storkey, 2006 


The PMCMC family of methods that have been discussed and used through¬ 
out this tutorial is a concrete example of another interesting trend, namely that 
of coupling various sampling methods into more powerful solutions. This is a 
trend that will continue to evolve. The online (Bayesian) inference problem is 
also a future challenge where we believe that SMC will play an important role. 


A Implementation details 

This appendix provides additional implementation details and clarifications 
about the numerical illustrations given in the paper. 

A.l Linear Gaussian state space model 

The LGSS model studied in Example [l] is given by: 

xt+i = 0.7xt-\-vt, ut ~ A/'(0,6*“^), (54a) 

yt = Xt-\-et, et ~ A/'(0,0.1), (54b) 

(6»~ 0(0.01,0.01)), (54c) 




























where the unknown parameter 0 corresponds to the precision of the process 
noise Vt (i.e., 6~^ is the process noise variance). Note that the prior for the 
Bayesian model is chosen as the Gamma (fj) distribution with know parameters 
for reasons of simplicity (it provides positive realizations and it is the conjugate 
prior). Specifically, Q{a^ b) denotes a Gamma distribution with shape a and rate 
b such that the mean is a/b: 


GiO;a,b) = 


b^Qa-i exp(-66») 

m ■ 


(55) 


The state process is assumed to be stationary. This implies that the distribution 
of the initial state (i.e. the state at time t = 1) is given by, 


p{xi I e) = Af(xi; 0, {(1 - 0.7^)e}-^) = {0.516»}-i). 


Identification of 9 is based on a simulated data set consisting of T = 100 samples 
2 / 1:100 with true parameter 00 = 1. 


A. 1.1 Marginalization 

Direct optimization The log-likelihood for the LGSS model is given by 

T T 

V{0) = logpe(j/i:T) = log]^pe(2/t |yi:t-i) = '^\ogpe{yt \ yi-.t-i) 

T 

= '^^ogAf{yt]Xt\t-i, Pt\t-i +0.1) 


4=1 
T r 




= E 


- olog27r- - logAt - ^{yt-xt\t-i) 


T 

= --log27r - 


log At + j-iyt - xt\t-i)" 


(56) 


where X 4 |t-i is the optimal predictor and Tt| 4 -i is the covariance of the pre¬ 
diction error. These quantities can be computed by the Kalman filter via the 
following recursions: 


At = Ft 1 4-1+0.1 (57a) 

Kt=0.7Pt\t-iK" (57b) 

xt+i 1 4 = 0.7xt I t-i + Kt{yt - xt I t-i) (57c) 

Pt+i 14 = 0.49Pt 14-1 + 9-^ - 0.7KtPt I t-i (57d) 


initialized with xi | o = 0 and Fi | o = (0.516*) the mean and covariance of xi, 
the state at time 6=1. 



The gradient of the objective function becomes 


T r 




dO 


where 


T 




=-5i: 


1 dA 


At de At^^‘ xt\t-i) 


dxt\t-i 


de 


1 / ^ ^2 dAt 


de 


dPt I t-i 

de ' 


In order to compute the gradient, we need to compute ■^Xt\t-i and -^Pt\t-i- 
This can be done recursively by differentiating (571 with respect to 0. We get 


dKt 

de 

d^t+i 1 1 

de 

dPt+i 1 1 

de 


0.7 


1 - 


ait-i\ dPt\t-i 


At 


de 


- {0.7- Kt)—— -h {yt - xt\t-i)^^, 


de 


= (0.49-0.7K.)^^-^-0.7P.,_,^. 


(58a) 

(58b) 

(58c) 


For a more complete treatment of this problem, see 


Astrom 


1980 


Metropolis Hastings The first task in setting up an MH algorithm to com¬ 
pute p{e I yi-.r) is to decide on which proposal distribution to use. For simplicity, 
let us make use of a random walk 

q{e'\e[m])=Af{e'\e[m],o.i). ( 59 ) 

Now that we can propose new samples according to ( |5^ , the probability of 
accepting the new sample e' has to be computed. The prior 7r(-) and the proposal 


q{-) are given by (54c) and (59), respectively. Hence, it remains to compute the 


likelihood. The log-likelihood, denoted by V{e) (to agree with the previous 
section), is given by (56) and it can be computed by running the Kalman filter 
(57). The resulting expression for the acceptance probability is thus given by 


a = 1 A 


= 1 A 


Pe'{yi-T)T^{e')q{e[m] \ O') 
Pe[-m]{yi-.T)'x{e[m\)q{e' I e[m]) 
P0'{yi-.T)'x{e') 


P0[m\{yi-.T)'x{e[m]) 

= 1 A exp(F(0') — V{e[m]) — 0.99log(0'/0[rn]) — 0.01(0' — 0[m])), 


(60) 


where the first equality follows from the fact that the random walk proposal 


(59) is symmetric in 0' and 0[m], and the second equality follows from (55) 
and (56). Note also that the prior tt is only supported on positive values on 0, 


so if a negative value is proposed it is automatically rejected (a = 0). 

































A.1.2 Data augmentation 


Expectation M 2 Lxmimisation In the problem setting, xi:t act as the latent 
variables. In order to apply the EM algorithm, we need to calculate the following 
surrogate function 


Q(6», e[k]) = Ee[fe] \ogpe{.xi,T, yi-.r)- 


(61) 


Expanding the right hand side of Eq. (611 gives that 

/ t=T 

Q{e,6[k]) =Eg[k]ilogpeixi)p(yi\xi) Y\_Pe{xt\xt-i)piyt\xt) 

\ t=2 

/ T T \ 

= Ee[fc] I logpg{xi) +'^\ogp9{xt\xt-i) + ^\ogp{yt\xt) j 

\ t=2 t=l / 


In the following, we will drop the terms which are independent of 9 and use the 
linearity of the expectation operator, which gives that 


Qi0,9[k]) = -{T\ogi9)-9 


T-l 


0.51Ee[fc] (xl) + V Egffc] {{xt+i - O.lxtf) 


+ const. 


We see that we need to compute expectations w.r.t. the smoothing distribu¬ 
tion (for the model parameterised by which can be done by running any 

convenient Kalman smoother. 

Next, the M step amounts to maximising Q{9,9[k]) with respect to 9. In 
our case, this maximisation has a closed form solution, given by 


9[k + 1] 


T 

0.51Ee[fe] {x\) + Ee[fc] ((st+i - Q.lxtY) 


(62) 


Gibbs The Gibbs sampler iterates between simulating xi.,t i^ompg{xi.T \ yi-.r) 
and 9 from p{9 \xi:T,yi-.T)- Simulating Xi-,t is done by backward sampling as 
shown in Eq. ([^. The filtering densities pe{xt\yi-.t) = N{xt\xt\t, Pt\t) are 


computed by running a Kalman filter. We then obtain the following expression 
for the backward kernel: 


Pe{xt I yi:t,xt+i) = N{xt \ pt, E*), 


with 


— Xt\t + + 0-49P£|iJ — 0.7xt\t)i 

E, = - 0.49P2^ Q+0.49Pqt^ 


-1 







As for the conditional distribution of 0: due to the fact that the Gamma distri¬ 
bution is the conjugate prior for the precision in a Gaussian model, we obtain 
a closed form expression for, 


T-l 


p{9 I Xi-.T, Vi-.t) = p{d I xi:t) oc p{xi.,T I 0)p{0) = p{0)p{xi I 9) p{xt+i \ xt,9) 


oc 0“ ^ exp(—60)-\/0exp ^ — 


0.516» 


T-l 


v^exp ( - -(xt+i - O.Txty 


T-l 


= 0 “+2 ^ exp [ - f 6 -h ^^xf + ^ X! ^^*+1 “ 


t=i 


oc 9\a+ — 


T-l 


0.51x1 


^{xt+i - O.Txt)' 


Note that we have use proportionality, rather than equality, in several of the 
steps above. However, since we know that p{9\xi:T,yi:T) is a PDF (i.e., it 
integrates to one), it is sufficient to obtain an expression which is proportional 
to a Gamma PDF (the penultimate line). By normalisation we then obtain that 
p{9 I Xi.,T, Vi-.t) is indeed given by the Gamma PDF on the final line. 

The above derivation can straightforwardly be generalized to derive a Gibbs 


sampler for a general LGSS model, the details are provided by Wills et al. 2012 


A.2 Nonlinear example 

Examplej^is borrowed from |Shumway and Stoffer [2011] (see pages 63, 131, 151, 
270, 280). Gonsider a data set consisting of 634 measurements of the thickness 
of ice varves (the layers of clay collected in glaciers) formed at a location in 
Massachusetts between years 9, 883 and 9, 250 BC. The data is modelled using 
a nonlinear state space model given by. 


xt+i\xt M{xt+i\(j)Xt,T ^), (63a) 

yt\xt ~ f/(j/t;6.25,0.256exp(-a;t)), (63b) 


with the parameters 9 = (0, r)^. The system is assumed to be stable and the 
state process stationary. This implies that the distribution of the initial state 
(i.e. the state at time t = 1) is given by, 

p{xi \9) =A/'(a;i;0, {(1 -(()^)t}"1). 


In the Bayesian setting, we use a uniform prior for </> to reflect the stability 
assumption, and a conjugate Gamma prior for t: 

p(r) =^l(r; 0.01,0.01). 


Identification of 9 is based on the measured data set consisting of T = 634 
samples 2/i:634- 









Algorithm 6 Gradient-based maximum likelihood inference in NL-SSMs 

Inputs: K (no. iterations), yi-T (data), Oq (initial parameter), 7 and a (step length sequence). 
Outputs: 9 (est. of parameter). 


1: Initialise the parameter estimate 60 = 9o and set A: = 1. 

2: while k < N ov until convergence do 

3: Run the FFBSi smoother at to obtain Vg logpg(yi;j’). 

4: Apply the update 9k = 9k-i + 7 ■ k~°'SJe logpe(j/i;T). 

5: Set k = k + 1. 

6: end while 
7: Set 9 = 9k- 


A.2.1 Marginalization 


Direct optimization — particle based gradient ascent For this imple¬ 
mentation, we make use of the approach from Poyiadjis et al. 2011| , which 
is summarised in Algorithm To improve the numerical performance, the 


inference is carried out over the transformed parameters (j) = tanh” (0) and 
f = log(r). Hence, the two parameters are now unconstrained and these types 
of transformations can often result in beneficial variance reduction. 

The gradient of the log-likelihood Vg logpg(i/i:T)|e=efc_i estimated by the 
Fisher identity using the fast forward-filtering backward-smoother (FFBSi) with 
early stopping as discussed by Taghavi et al. We make use of 500 forward 

particles, 100 backward trajectories and rejection sampling for 75 trajectories. 
For the Fisher identity, we require calculating the gradients of the complete data 
log-likelihood (with respect to the transformed parameters). Note first that the 
complete data log-likelihood is given by 


T-l 


\ogPe{xi,T, Vi-.t) = logpe(xi) + ^ \ogpe{xt+i \ xt) + const. 


T-l 


= F \ log((l - </'^)r) - (1 - (t>^)Txl + (logr - T{xt+1 - (jjxtf) 




const. 

(64) 


We thus get. 


^logpg{xi,T,yi-.T) = ^{logpe{xi,T,yi-.T)} [ ^ 


T-l 


= -(/) -f (1 - (/)^)t -f ^ Xt{xt+1 - (l)Xt) I , 
where we have used the fact that Jr tanh“^((()) = (1 — Furthermore, we 


















have. 


■^\ogpg{xi..T,yi-.T) = ^ {logpe(a;i:T,yi:T)} 


= 4>'^)xl - r ^ {xt+i - 4>xt 


T-l 


The optimisation is initialised in (untransformed) parameters {4>, t} = {0.95,10} 
with a = —2/3, 7 = 0.01 and runs for K = 250 iterations. 

Metropolis Hastings — PMH The sampler is implemented in two steps. 


In the Hrst step the smooth particle filter Malik and Pitt 2011 is used with 


500 particles to get an initialisation of the parameters and to estimate the 
Hessian of the log-likelihood. The optimisation of the log-likelihood is done 
using a bounded limited-memory BFGS optimizer and the Hessian is estimated 
numerically using a central finite difference scheme. The resulting estimates of 
the parameters and inverse Hessian are 


0ML = {0.95,0.02} X(0ml) = 1O 


-5 


9.30 

2.96 


2.96 

1.99 


The PMHO algorithm is initialised in Oml and makes use of the bootstrap par¬ 
ticle filter with 1 000 particles to estimate the log-likelihood. The proposal is 
selected using the rules-of-thumb in Sherlock et al. 1 ( 2 ^ as 


q{0"\e') = U{e"-e\ (2.562V2)i(0ml)). 

We use 15 000 iterations (discarding the first 2 000 iterations as burn-in) to 
estimate the posteriors and their means. 


A.2.2 Data augmentation 

Expectation Maximisation — PSAEM We outline the implementation de¬ 


tails for the Particle SAEM algorithm (see Section 8.1), but the implementation 


for the PSEM algorithm (see Section 7.1) follows similarly. 


Note that particle SAEM requires us to compute an approximation of the 
Q-function according to ( [5^ . The complete data log-likelihood can be written 
as (see (H)), 

\ogpe{xi.,T,yi-.T) 

= ^ |log((l - 4'^)t) - (1 - (t>^)Txl -f ^ (logr - T{xt+1 - (/a^t)^)! + const. 

If we define the complete data sufficient statistics as S := (4*, $, E, A)^ € 
with 
















we can thus write \ogpg{xi-T,yi-.T) = —0.5/(0;5) + const., where the function 
/ is defined as: 

f{9]S) := - log((l - 4>^)t) +X{1- 

+ (r-l){-logr + r($-2^'<() + </^S)}. (65) 


Expressing the complete data log-likelihood in terms of its sufficient statistics in 
this way is useful, since it allows us to write the approximation of the Q-function 
in (531 as: 


Qk{9) = -O.5/(0; Sk) + const.. 


where Sk = ('1'^, S^, is a stochastic approximation of the sufficient 

statistics, computed recursively as 


T-l / N 

$, = (1 - ak)^k-i + ^ E [Y.^T[k]xU^[k]x\[k] 

T / N 

$, = (1 - ak)^k-i + ^ E E ^T{k]{xl[k]f 

T-l / N 

Efc = (1 - ak)tk-, + ^ E [Y.^Tmx\[k]f 

^ t=l \i^l 
N 

Xk = {1 - ak)Xk-i + ak'^w"j,[k]{x\[k])'^, 

i=l 




where {x\.rp[k],w^[k]}fL-^ are the particle trajectories generated by the EGAS 
algorithm at iteration k. 

Maximising Qk (0) in the M-step of the algorithm is thus equivalent to min¬ 
imising f{d;Sk)- Let us therefore turn to the problem of minimising /(0;5) for 
an arbitrary (but fixed) value of the sufficient statistics S. First, noting that the 
leading two terms in (651 originate from the initial condition, which should have 


a negligible effect on the maximising argument for large T, a good initialisation 
for the maximisation can be obtained by approximating 

/(0; 5) « (T - 1) {- logr + t($ - 24-<^ + . 


Indeed, minimising this approximation can be done on closed form, suggesting 
that 


Topt. « ($-4'VS)-^ 


This provides us with a good initialisation for a numerical optimisation method 
which can be used to minimise f{6;S) to desired precision. 







PGAS In the PGAS algorithm for Bayesian inference we employ a Gibbs 
sampler, iteratively simulating xi-,t from the PGAS kernel, and 9 from the 
conditional distribution p{9 \ Vi-.t)- This distribution is given by 


p(<^,r|xnr,J/i:T) = (66) 

where the constants are given as follows: 


b = 



t=i 


{T,Li Xt+lXt] 


T-1 




- Et=l Xt+iXt 

^ ~ v^T-l 2 ' 

Et=2 Xf 


(67) 

( 68 ) 
(69) 


Simulating from ( 66 ) is done by rejection sampling with an instrumental distri¬ 
bution, 

t\xi,t, Vi-.t) = 0 f t; a -f -y-. H {(j)-, p., t"^) . (70) 


Specifically, we propose a draw {(f)' ,t') from the instrumental distribution and 
accept this as a draw from ( 66 ) with probability l{| 0 |<i}\/l ~ 4’’^- 
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